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Abstract 



A nonparametric procedure for robust regression estimation and for 
quantile regression is proposed which is completely data-driven and adapts 
locally to the regularity of the regression function. This is ac-hieved by 
considering in each point M-estimators over different local neighbourhoods 
and by a local model selection procedure based on sequential testing. Non- 
asymptotic risk bounds are obtained, which yield rate-optimahty for large 
sample asymptotics under weak conditions. Simulations for different uni- 
variate median regression models show good finite sample properties, also 
in comparison to traditional methods. The approach is extended to image 
denoising and applied to CT scans in cancer research. 
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1 Introduction 



We consider a generalized regression model 

Yi = g{xi) + ei, i = l,...,n, 

with (ej) i.i.d., xi, . . . , in the design space and g : ^ ^ The problems 
we have in view are those of robust nonparametric estimation of g in the pres- 
ence of heavy-tailed noise (sj) and of nonparametric quantile estimation, which 
is becoming more and more popular in applications. One main application will 
be robust image denoising. In the spirit of classical M-estimation (Huber 1964) 
we therefore consider g{xi) as the location parameter in the observation Yi, that 
is 

g{xi) = argmia^giK E[p{Yi - m)] (1.1) 
for some convex function p : M ^ with p(0) = 0. We shall assume that g{xi) 



is uniquely defined by (1.1 ), which is true in all cases of interest. If the Yi have 
Lebesgue densities, then often an equivalent description is given by the first 
order condition E[/j'(ej)] = where p' denotes the (weak) derivative. Standard 
examples are p{x) = for the classical mean regression model (E[ej] = 0), 
p{x) = \x\ for the median regression model (P(ei ^ 0) = P(ei ^ 0) = 1/2) and 
the intermediate case p{x) = x^/2 for \x\ ^ k and p{x) = k\x\ — fc^/2 for |x| ^ k 
with some k > for the Huber estimator (E[min(max(ej, — A;), A;)] = 0). The 
quantile regression model is obtained for p(x) = \x\ + (2a — l)x (P(ei ^ 0) = a 
with quantile a G (0,1)), see e.g. Koenker (2005). Since we shall care about 
robustness, we merely assume a mild moment condition G for some r ^ 1 
and measure the error in L'"-norm. 



The function g is not supposed to satisfy a global smoothness criterion, but we 
aim at estimating it locally in each point x E ^ as efficiently as possible. The 
risk will then depend on local regularity properties, which we do not assume to 
be known. For spatially inhomogeneous functions, in the presence of jumps or for 
image denoising pointwise adaptive methods are much more appropriate than 
global smoothing methods. In classical mean regression local adaptivity can be 
achieved using wavelet thresholding or kernels with locally varying bandwidths, 
see Lepski et al. (1997) for a discussion. In this ideal situation a data-driven 
choice among linear empirical quantities is performed. M-estimators are typi- 
cally nonlinear and the standard approaches do not necessarily transfer directly. 
Brown et al. (2008), for example, use an intermediate data binning and then 
apply wavelet thresholding to the binned data for median regression. On the 
other hand. Hall and Jones (1990), Portnoy (1997) and van de Geer (2003) con- 
sider kernels, smoothing splines and more general M-estimation for quantile 
regression, but they all use global methods for choosing the tuning parame- 
ters like cross-validation or penalisation. Here, we develop a generic algorithm 
to select optimally among local M-estimators. In contrast to classical model 
selection, we do not only rely on the estimator values themselves to define a 
data-driven selection criterion. This has significant advantages in the present 
case of nonlinear base estimators. 
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Subsequently, we assume that the statistician has chosen the suitable definition 
of p for the problem at hand and we use the corresponding sample versions 
to construct base estimators for the (generalized) regression function g. In the 
spirit of classical nonpar ametrics, we assume that g is locally almost constant 
around a given point x G J?^. The statistical challenge is to select adaptively 
the right neighbourhood U ol x where a local M-estimator is applied. Let us 
write 

m{Yi, Xi £ U) := arginf^gjj | ^ p{Yi - /i)| (1.2) 

for the location estimator on the set [/ C If the minimizer is not unique, we 
just select one of them (e.g., a version of the sample median for \U\ even). Note 
that an extension to general local polynomial or more general local-likelihood 
estimation is straightforward, but this is not the focus of the present work. For 
each point x let a family of nested neighbourhoods Uq Ui C ■ ■ ■ CI JJk be 
given and set 

dk ■■= m(Yi, Xi £ Uk). 

Then the family {'&k)o^k^K forms the class of base estimators and we aim at 
selecting the best estimator of ?? := g{x) in this family. 

1.1 Example. Let the design space be = [0, 1] with equidistant design 
points Xi = i/n and take p{x) = \x\. Consider the symmetric windows Uk = 
[x — hk-,x + hk] generated by some bandwidths ^ /iq < /ii < • • • < h^. Then 
'dk is the classical median filter, see e.g. Truong (1989) or Arias-Castro and 
Donoho (2006). 

Using Lepski's approach as a starting point, we present our procedure to se- 
lect optimally among local M-estimators in Section |2] We argue in a multiple 
testing interpretation that our procedure is usually more powerful. Moreover, 
it is equally simple to analyze and easy to implement. In Sections |3] and |4] we 
derive exact and asymptotic error bounds and the latter give optimal minimax 
rates for Holder classes. The simulations in Section [5] show that our procedure 
has convincing finite sample properties. Moreover, they confirm that Lepski's 
classical method applied to local median estimators suffers from oversmoothing 
because changes in the signal are not detected early enough due to the robust- 
ness of the median. Finally, the procedure has been implemented to denoise 
dynamical CT image sequences in Section [6] which is of key interest when as- 
sessing tumor therapies. Two more technical proofs are postponed to Section 

m 

2 The procedure 
2.1 Main ideas 

As a starting point let us consider the standard Lepski (1990) method for se- 
lecting among {'dk)o^k^K, given the mean regression model with K[ei] = and 
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E[e?] < oo. Note that the base estimators are then ordered according to de- 
creasing variances: Var(i?fc) ^ Var(i?fc_i). On the other hand, the bias is usuahy 
increasing with increasing neighbourhoods Uk- This is not always the case (for 
example, think of local means for a linear g), but true in particular for the 
worst case bias over smoothness classes like Holder balls of functions. Lepski's 
method can be understood as a multiple testing procedure where the hypothe- 
sis Hoi^k) : g\iji^ = 1) that g is constant on is tested against the alternative 
of significant deviations. Always assuming that Hq(0) is true, we test sequen- 
tially whether H{){k + 1) is acceptable provided that the hypotheses Ho^i) have 
been accepted for all £ ^ k. Once the test of an HQ{k + 1) is rejected, we se- 
lect the base estimator "i^^ corresponding to the last accepted hypothesis. The 
main point is thus to properly define the single significance tests for Hg^k + 1). 
Lepski's method accepts HQ(k + 1) if {"^k+i — "^el ^ ^e'^^^ holds for all i ^ k 
with suitable critical values > 0. The wide applicability and success of 

Lepski's method is also due to this very simple and intuitive test statistics. 

In our nonlinear estimation case it turns out that tests for Ho^k + 1) based 
on the differences of base estimators are often not optimal. To understand 
this fact, let us consider a toy model of two neighbourhoods Ui C U2 with 
a piecewise constant median regression function g equal to fii on Ui and to /i2 
on C/2 \ t^i- The procedure therefore reduces to a simple two-sample location 
test between the observations in Ui and in U2\Ui. We proceed by considering 
abstractly a two-sample location test where the first sample Yi,. . . ,Yn is i.i.d. 
with density fi{x) = 2^exp(— |x — ^il/o") and the second independent sample 
Yn+i, . . . ,Y2n is i.i.d. with density f2{x) = 2^exp(— |x — fi2\/o'). Our goal is 
to test Hq ■ fJ-i = 1^2 for known cr > 0. Given the Laplace distribution, we 
follow Lepski's idea and put rhi = med(li, i = 1, . . . ,n), the median over the 
first sample, and 1112 = med(y^, i = 1, . . . ,2n), the median over both samples. 
Then the test rejects if Tl := 2\rhi — 7712] > z holds for appropriate z > 0. 
A more classical approach, though, relies on a likelihood ratio (LR) test or on 
a Wald-type test using the maximum likelihood estimator for fii — /X2- Since 
the LR test is not as simple, we focus on the Wald-test statistic which is given 
by the difference of the medians over the two samples. Hence, we reject Ho 
if Tw := |med(yj, i = 1, . . . , n) — med(yi, i = n + 1, . . . , 2n)| > z holds for 
appropriate z > 0. The following asymptotic result for the two test statistics is 



proved in Section 7.1 



2.1 Proposition. Let / : M ^ be a symmetric and continuous density 
with /(O) > and let Yi, . . . ,Yn f, y„+i, . . . , l2n ~ /(• - A) with A > 
be independently distributed. Then with F denoting the cumulative distribution 
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function of f we obtain for n ^ oo 

n(med{Yi, i = n + 1, . . . ,2n) - med{Yi, i = 1, . . . ,n) - A) iV(0,cr^ 



w) 



n(2(med(yi, i = l,...,2n) - med(yi, i = 1, . . . , n) ) - A ) ^ ^^(O,^^) 



2 _ 2F(A/2)(1 - F(A/2)) ,1 2(1 - F(A/2)) 
^" /2(A/2) +/2(0) /(0)/(A/2) • 

In particular, for IS. = Q we have a\ = o"^ antf for A ^ we /lawe t/ie order 
a\ = Cyj/Cl + 2A/(0) + O(A^/(0))), provided f is Lipschitz continuous at zero. 



Putting A = l/xi — /U2I this result shows that under Hq, i.e. A = 0, the test 
statistics and T]y are asymptotically identically distributed, whereas has 
a larger asymptotic variance under any alternative A > than Tw. In the 
above Laplace model with densities /i,/2 this deterioration is only negligible 
if the signal-to- noise ratio satisfies — /U2I/0" <C 1. This is exactly what we 
see in simulations, see e.g. Example 1 in Section |5] below. Since the Laplace 
model is Hellinger differentiable, the Wald-type test is (locally) asymptotically 
efficient for n —> 00 as is the LR test, see e.g. van der Vaart (1998). Strictly 
speaking, when considering local alternatives for fixed a > and n — > 00, i.e. 
\fJ'i — fJ'2\ = 0{n^^^'^), then the deterioration in using Tl becomes also negligible. 
From a practical perspective, these local asymptotics are often not adequate, 
e.g. in image denoising, where we face relatively large signal differences A at 
borders between objects and do not dispose of a very large number n of observed 
pixels. 

More generally, two-sample location tests can naturally be based on the differ- 
ence of the in-sample location estimators. In consequence, we proceed differently 
in testing the hypotheses HQ{k + l) of homogeneity: When the hypotheses -ffo(^) 
for £ ^ k have been accepted, we ask whether the observations Yi in the new 
points Xi G Uk+i \ Uk are homogeneous with those in Ue for i ^ k. This means 
that our tests reject if the empirical location in the additional data 

^^(fc+i)\fc := m{Yi, Xi E Uk+i \ Uk) 

satisfies with certain critical values z^'^'^^^ > 0: 

3£^k: \^ik+i)\k-^i\>4''''^. 

As in Lepski's method, it is necessary to perform the testing for all i ^ k and not 
only with i = kto avoid that the signal slowly drifts away as the neighbourhoods 
grow. In most cases, though, i/o(^ + 1) will be rejected because the new piece 
'd(^k+i)\k is not in line with due to the smaller variance of -dk compared to 
'di, I < k, this last test is the most powerful. It is then interesting to observe 
that for linear m the test statistic 'd[k+i)\k ~ is just a multiple of "i^/t+i — 'dk- 
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Consequently, for mean regression with linear base estimators our method will 
not differ much from Lepski's standard method, whereas the general nonlinear 
M-estimators are treated in a significantly different way, note also the numerical 
results in Section HI 

Observe that our approach breaks an ubiquitous paradigm in modern statistics 
and learning theory (see e.g. Massart (2007) for model selection or Tsybakov 
(2004) for aggregation): we select the best base learner among {'dk) in a data- 
driven way not only based on the estimator values themselves, but additionally 
on the statistics (i?(fc+i)\fc)- Not only in the abstract modeling above, but also 
in implementations this idea turns out to be very advantageous for nonlinear 
estimators. 



2.2 The algorithm 

We want to select the best estimator among the family {t?^ | A: = 0, . . . ,K}. 
Considering the law Pq generated by the no-bias setting g = 0, we introduce 
the stochastic error levels 

s, := Eo[|^,^]l/^ Sk, := M\^ik+i)\k " m'^'- (2.1) 

We apply the following sequential procedure for prescribed critical values 
(2;j)j=o,...,x-i and set zk '■= 1: 

• initialize k := 0; 

• repeat 

if for all j = 0, . . . ,k 

l''^(fc+l)\fc - ^ ZjSkj + Zk+lSk+l 

then increase k 
else stop 
until k = K ; 

• put k := k and -d := "df^. 

This algorithm to determine k can be cast in one formula: 



:= inf U > 



3 Error analysis 

3.1 Propagation and stopping late 

We need a very natural property of the M-estimator. 
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3.1 Assumption. The location estimator in (1.2) satisfies for any set S and 

any partition S = IJ^ Sj with pairwise disjoint sets Sj : 

miiij m(Yi, Xi G Sj) ^ m{Yi, Xi £ S) ^ maxj m{Yi, Xi G Sj). 



3.2 Lemma. If the function p is strictly convex, then Assumption 3.1 is satis- 
fied. 

Proof. Let us write niT as short-hand for m(Yi, Xi £ T), T ^ . Denoting 
by p'j^ , p'_ the right- and left-handed derivatives of the convex function p, the 
functions p'_^, p'_ are strictly increasing with p'^{x) < p'-{y) ^ p'+iu) foi' all 
X < y and 

p'-iy^-^T) ^0, p'+{Y^ - mr) > 0. 

If < were true for all j, then 
Y p'-iYi-ms) > 

Y E PW-ms,)>0, 

Xi^S j X'i^Sj 

which contradicts the minimizing property of ms. Hence, ms ^ miuj ms- holds 
and a symmetric argument shows nig ^ max^ msj • □ 

3.3 Remark. If p is not strictly convex, then we usually impose additional 
conditions to define m uniquely. For any reasonable specific choice Assumption 



3.1 should be satisfied. In particular, this is true for the sample median where 
we take for an even number N of data points Yi the mean (Y(^j\f/2) +^(i+Af/2))/2 
of the order statistics. 



3.4 Proposition. Grant Assumption 3.1. Then we have for any k = 0, . . . , K— 
1 

\^ - -i?fc|l(A: > k) ^ max (zkSjk + Zj+iSj+i). 

j=k+l,...,K—l 

3.5 Remark. This error propagation result is true 'w-wise', that is, it does 
not depend on the noise realisation. It is built into the construction of the 
selection procedure. An analogous result holds for Lepski's original procedure 
(Lepski 1990, Lepski et al. 1997). 



Proof. From Assumption 3.1 we infer for i > k 

We - ^ ^ max^Jt?,-\(,_i) - ^k\. 

We therefore obtain on the event {A; > A;} by construction 

max - max (zkSjk + Zj+iSj+i). 

j=k+l,...,k j=k+l,...,K-l 



□ 
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3.6 Example. For geometrically decreasing stochastic error levels Sk in (2.1), 



in particular for the median filter from Example 1.1 with bandwidths hj. = h^q , 
we have Sjk ^ Sfc for j > k, where A < B means A = ff{B) in the ^-notation. 
The late stopping error is of order -z^s^, provided the critical values (z^) are 
non-increasing. This will imply that the error due to stopping later than some 
optimal k* is increased by at most the order of zj,, : 

E^[i^ - i?ri(^ > k*)] < E4\h* - ^?r] + 4*4- ^ (1 + zi,)E4\^k' - m- 



3.2 Critical values and stopping early 



As the preceding analysis shows, small critical values (zfc) lead to small errors 
caused by stopping late. On the other hand, the (z^) should not be too small 
in order to control the error of stopping early. To this end, we shall require a 
condition on the critical values (z^) in the no-bias situation under ¥q, that is 
for constant g = 0. In fact, we face a multiple testing problem, but with an 
estimation-type loss function. For some confidence parameter a > we select 
Zk > 0, k = 0, . . . , K — 1, such that the condition 



j=0 



> zesjt) 



^ as 



K 



(3.1) 



is satisfied. In order to obtain a unique prescription for each Zk that equilibrates 
the errors for different stopping times of the algorithm, we can select the (z^) 
sequentially. We choose zq such that 



j=0 

and then each z^ for given zq, . . . , Zfc-i such that 



K-l 



|i9jri(|^?(j+i)\j-??fc| > ZkS.k, y£<k: 



(3.2) 

To determine the (zk) in practice, we simulate in Monte Carlo iterations the 
pure noise case g = and calculate for each k the error when the algorithm 
stops before the (theoretically optimal) index K due to a rejected test involving 
Zk- The critical values are determined such that this error is a fraction of the 
oracle estimation error s^. For this calibration step the original algorithm of 
Section 2.2 is taken, only modified by using ZjSkj instead of ZjSkj + Zk+iSk+i 



in the testing parts. 

The selection rule for the critical values in Lepski's procedure is the focus in 
the work by Spokoiny and Vial (2009). Their idea is to transfer properties from 
the no-bias situation to the general nonparametric specification by bounding 
the likelihood between the two observation models. This approach, the so-called 
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small modeling bias condition, could be applied here as well and will give similar 
results. On a practical level, the difference is that Spokoiny and Vial (2009) 
enlarge the moment from r to 2r in the calibration step, while we add the term 
Zfc+iSfc+i to the testing values zjSkj from the calibration. In the asymptotic 
analysis, however, the method by Spokoiny and Vial (2009) costs us some power 
in the logarithmic factor and we would thus not attain optimal rates over Holder 
balls, cf. Section |4] Moreover, for robustness reasons, we do not want to require 
higher moment bounds for the error variables and the likelihood. 

3.7 Definition. Given the regression function g, introduce its variation on Uk 

Vfc(5f) := sup \g{yi) - g{y2)\ 

and consider the oracle-type index 

F := min{A; = 0, . . . , - 1 1 Yk+iio) > ^fc+iSfe+i} A K. 



This definition implies that for all k ^ k* the maximal bias Vk{g) of -dk is less 



than its stochastic error level Sk from (2.1) times the critical value z^. The 
next result, when specialised to A; = A;*, means intuitively that the error due 
to stopping before k* can be bounded in terms of the stochastic error of 'ffk* j 
involving the critical value factor. Let us also mention here that the 



rationale for the choice z/^ = 1 in the algorithm of Section 2.2 is to equilibrate 
maximal bias and stochastic error at step k = K — 1. 

3.8 Proposition. We have for any k = 0, . . . ,k* 

E - 4ri(fc < k)] ^ V 1)(4 + 1 + a)sl. 

Proof. We shall write k{g), ^kid) etc. to indicate that k, etc. depend on the 
underlying regression function g. We shall need the inequality 

WAa) - h{9)\ ^ - 4(0)1 + Vk{g) for j < k (3.3) 

which follows from 

^jia) - ^kig) = 'm{g{xi) + Ei, Xi £ Uj) - m{g{xi) + Ei, Xi G Uk) 

^ m{Ei, Xi £ Uj) + sup g{x) - m{Ei, Xi £ Uk) - inf g{x) 

^M0)-M0) + Vk{9) 
and by a symmetric argument for '&k{g) — ^jis)- 



By definition of k* and using the condition on the (zk) as well as (3.3) for 
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and we obtain for all k ^ k* 

E[\^g)-M9)rm9)<k)] 

k-l 

= Y,^[\M9) -M9)rm9) = j)] 
k-l 

^ [{\,{g) + 1^,(0)1 + \MOWm9) = j)] 

j=Q 

^{3'-'\/l)[\,{gy+E[\M0)n + 

k-l 

+ [|7?,-(0)ri(3£ ^ j : \^ij+i)\j{g) - M9)\ > zes.i + zj+^Sj+i 

j=0 

^(3-Wi)(44 + 4+ 

k-l 

+ [|i?,-(0)ri(3£ ^ j : |^o-+i)\j(0) - + Vj+i{g) > z,s,, + zj+is^+i 

j=0 



The result follows from the isotonic decay of (sfe). □ 
3.3 Total risk bound 

3.9 Theorem. Assume that {zkSk) is non-increasing in k. Then under As- 



sumption 3. 1 the following excess risk estimate holds for all k ^ k*: 



\ j=k+l,...,K—l ■' / 



Proof. For the late-stopping error Proposition 3.4 and the decay of {z^Sk) give 



\^-^kri{k >k)^ (2^-ivl)max(4.^^,+4+i.5+i) ^ {r-^Vl)zl{sl+maxs^^,) . 
Add the early-stopping error from Proposition |3.8[ □ 



3.10 Example (continued). For geometrically increasing bandwidths {hk) we 
obtain Sj^ ^ Sfc for j > k and thus 

iE[i^-^fc*n<(«+404*- 

The factor a -\- 4* is the term we pay for adaptation. 
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4 Asymptotic risk 



4.1 General result 

We shall derive convergence rates for n — oo of the critical values (zk)- All 
quantities in the procedure may depend on n, but we still write U^, K and 
instead of Uk{n), K(n), Zk{n). The notation A < B will always mean A{n) ^ 
cB{n) with some c > independent of n and A ~ i? is short for A < B and 
B < A. We work under the following assumption whose validity under mild 
conditions will be derived in the next subsection. 

4.1 Assumption. 

(a) The cardinalities of the neighbourhoods Uk grow with geometric order: 

qiNk ^ Nk+i ^ q2Nk for all k = 0, . . . K - 1 

for some fixed q2 ^ qi > 1 and with Ni/ log{NK) — > oo, Nk ~ n as 
n — > oo. 

(b) For all sufficiently large N we have 

E[|m(e„ i = 1, . . . , N)^^/^ ~ E[|m(e„ i = 1, . . . , iV)|2^]i/2r ^ ^-1/2^ 

(c) For all r^r — > 00 with t^N'^/"^ — > a moderate deviations bound applies: 
there is some c > such that 

limsupe'^^^P (7V^/^|m(ei, i = l,...,N)\ > tn) < 00. 

Af^oo 

The following asymptotic bounds follow directly from the definitions: 



— 1 /2 —1/2 

4.2 Lemma. Assumption 4-l(^) implies Sj ~ N- and N- A {Nk+i 



NkY^^"^ < Skj < Nj V {Nk+i - Nk)"^'"^. Assumption 4.1(a) then yields for 

Sj ~ Skj ~ A^/^^^ 



Under Assumption 4.1 critical values of the same order as in the Gaussian case 
suffice. 



4.3 Proposition. Grant Assumption 4-1 cLnd suppose a E (0,1). We can 

choose 

zl = C{2r\og{sk/sK) + log(a-^) + log(i^)), A: = 0, . . . , K - 1, 



with C > a sufficiently large constant in order to satisfy Condition (3.2). For 
K ~ log n this yields asymptotically Zk ~ \/log n. 
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Note that the chosen critical values Zk are decreasing in k, which has the de- 
sirable effect that we do not permit stopping at an early stage with the same 
probability as stopping at higher indices k. Moreover, this guarantees that ZkSk 
is non-increasing in k, the hypothesis in Theorem 3.9, From Theorem 3.9 we 
therefore obtain the following asymptotic risk bound. 



4.4 Corollary. Grant Assumptions 3.1 and 4-1 cL^d let K ~ logn. Choosing 
the critical values as in Proposition \4-S\ gives 

E[|T?-^r]<(iognr/2E[|4--^r]. 

4.5 Example (continued). Let us specify to s-Holder continuous g : [0, 1] — > R, 
equidistant design and kernel estimators with geometrically increasing band- 
widths hk = hoq^, K ~ log(n). Then we can choose ~ y^log(n) and the index 
k* satisfies \k*{ff ~ hf, ~ log(n), that is hk* ~ (log(n)/n)i/(2«+i) 
and Zk*Sk* ~ (log(n)/n)**/^^*"'""'^^. This is the classical minimax rate for point- 
wise adaptive estimation in the one- dimensional s-Holder continuous case, see 
Lepski et al. (1997) for the Gaussian case. Here, we have derived the same rate 
for pointwise adaptive M-estimation under very weak conditions on the error 
distribution, compare the discussion on specific models below. Let us also men- 
tion that Truong (1989) obtains the same rate result, but without logarithmic 
factor, for the non-adaptive median regression case. 



Proof of Proposition 4-3. Let j ^ k. For n sufficiently large Assumption |4.1[ c) 
together with the asymptotics ZkSjk < (log(A''/<)iV^^)-'^/^ (using Assump- 



tion |43|a,b) and Lemma |4.2[ ) yields 



> ZkSjk/'^) + ^o{Wk\ > ZkSjk/2) 



< exp{-czls]k{Nj+i - Nj)/4) +exp{-czls]kNk/4). 



By Lemma 4.2 there is another constant c' > such that for large Zk 
^o{\^{j+i)\j - ^k\ > ZkSjk) < exp{-c'zl). 

Our choice of Zk with C, sufficiently large guarantees exp(— c'z^/2) = 
o{a{sK / SkY for large K. We therefore more than satisfy (3.1) and the 
construction in ( |3.2| ) provided n is sufficiently large: 

K-l 



j=k 



j+ 



j=k 
K-l 

<Y,s]eM-czl/2) 

j=k 

= o{{K-k)sla{sK/skyK-^) 
\ K 
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For K ~ logA^ we obtain log{NK/Nk) ^ {K — k)logq2 < logA^ and thus 



z? ~ log n. □ 



4.2 Specific models 



The preceding asymptotic analysis was based on Assumption 4.1 where part (a) 
can be ensured by construction whereas parts (b) and (c) depend on the noise 
model and the choice of M-estimator. The most severe restriction will usually 
be the moderate deviation property of Assumption |4.1[ c). In the case where 
the law of the error variable Ei is absolutely continuous, this property holds by 
Corollary 2.1 in Arcones (2002) under the following conditions: 

(a) E[p{e^ + h)- p{£i)] = Vh^ + o{h?) for some F > and |/i| ^ 0; 

(b) p is Lebesgue-almost everywhere differentiable with derivative p'; 

(c) there are A,(5 > such that E[exp(A|/3'(ej)|)] and E[exp(A sup|;j|<g5|/9(ej + 
h) — p{£i) — hp' {ei)\/h)\ are finite. 

For mean regression p{x) = we have V = 1 and p'{£i) = 2ej such that a finite 
exponential moment for £i is required. For median regression the result applies 
with y = /e(0)/2 and p'{£i) = sgn(ej) and because of + — |ej| — /isgn(ej)| ^ 
2h no moment bound is required. The same is true for any robust statistic 
with bounded influence function, in particular for the Huber estimator and 
general quantile estimators. Arcones (2002) discusses that an exponential tail 
estimate for p'{£i) is also necessary to obtain a moderate deviation bound, which 
might be a serious drawback when using Lepski's method with linear non-robust 
estimators. 

For the median the requirements are not difficult to verify directly. Assumption 
|4.1[ b) is for example established by Chu and Hotelling (1955), who show that 
for /e continuously differentiable around zero, /e(0) > 0, r G N and Z ~ iV(0, 1): 

hm iV^E[med(ei, . . . ,e,v)'l = (2/,(0))-^' EiZ^^. 



Using a coupling result, we can establish Assumption 4.1 ^b,c) under even more 



general conditions, see Section 7.2 for a proof: 



4.6 Proposition. Assume that the £i have a Lebesgue density fe which is 
Lipschitz continuous at zero and satisfies j^^f^{x)dx = 1/2, /e(0) > 0, 
E[|ej|''] < CO. Noting med(e) := med(ei, . . . ,£n), N odd, we have 

ViV ^ 5 : E[|med(e)|n ~ N'''^^ and E[|med(e)|2'-] ~ iV"''- 

as well as for rjv ^ oo with tn = o{N^/'^) 

limsupP (2iV^/25e(0)|med(e)| > tn) exp(r^/8) ^ 2. 
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regression function and data in Example 1 a 



IVIC errors in Example 1a 




Figure 1: Example 1 with Laplace noise: A typical realisation and a box plot of 
the sample errors in 1000 Monte Carlo runs. 



5 Simulation results 



We illustrate our procedure by an implementation for median regression on 
^ = [—1,1] and the estimation of the regression function at x = 0. We simulate 
n = 200 equidistant observations (Yi) with standardized errors (ej) (E[ej] = 0, 
Var(ej) = 1) that are (a) Laplace, (b) normal and (c) Student t-distributed 
with three degrees of freedom. The location is each time estimated by local 
sample means as well as by local sample medians. As neighbourhoods we take 
symmetric intervals C/^ around zero containing [5*^/4'^"^] data points. This gives 
K = 17 different base estimators. 



The calibration of the procedure is performed for Laplace distributed errors 
with r = 2 and a = 1. The variances Sj , Sjk of the sample means are calculated 
exactly and those of the sample medians are approximated by their asymptotic 
values (which are quite close to Monte Carlo values). The critical values {zk) 



are chosen according to the prescription in (3.1 ). This is achieved in both cases, 
mean and median estimators, by using the choice in Proposition 4.3 with values 
that are calibrated by 10000 Monte Carlo runs for the pure noise situation. 
It turned out that this gives almost equally sized error contributions for the 



different values Zk, as postulated in (3.2). The same calibration principle was 



applied for the original Lepski procedure with mean and median estimators. 

As a first example we take a simple change point problem by considering the 
regression function g(x) = for |x| ^ 0.2 and g{x) = 2 for > 0.2, which can 
be considered as a toy model for edge detection in image restauration or for 
structural breaks in econometrics. In Figure 1 we show a typical data set in the 
Laplace case (a) together with box plots for the absolute error of the different 
methods in 1000 Monte Carlo repetitions: local means with Lepski's and with 
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MC errors in Example 1b 



MC errors in Examplelc 
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Figure 2: Box plot of the sample errors in 1000 Monte Carlo runs for Gaussian 
(left) and Student t(3) noise (right). 



our method, local medians with Lepski's and with our method and the oracle 
method, which is just the sample median over [—0.2, 0.2] = {x : g{x) = 0}. For 
exactly the same methods, especially still calibrated to Laplace errors. Figure 
2 presents the results for Gaussian and heavy-tailed Student t(3) errors. 

It is obvious that in all cases Lepski's method applied to sample medians as 
base estimators works quite badly. This is due to the fact that this method 
stops far too late: the sample median over the complete intervals does not 
really 'notice' the jump in the data. In fact, in the Laplace simulation study the 
oracle fc = 10 is selected by this method in less than 1% of the cases while most 
often (65%) the selection is A; = 12 which yields the 1.5 times larger window 
Ui2 = [—0.29,0.29]. The methods using the sample mean estimators perform 
reasonably well and especially both very similarly. Still, they are clearly beaten 
by our median based procedure in cases (a) and (c) where the median is the 
more efficient location estimator. It is remarkable here that we nearly achieve 
the risk of the oracle median estimator. Even in the Gaussian case (b) the 
linear procedures have only minor advantages. Finally, we notice the robustness 
property that the calibration with the wrong error distribution in Figure 2 does 
not seriously affect the results. 

In a second example we consider the smooth regression function g{x) = 2x{x + 
1). Because we are estimating locally around a; = 0, this is a caricature of a 
C^-function with ^'(0) = 2 and ^"(O) = 4. Figure 3 shows again a typical data 
set and boxplots for the different methods in 1000 Monte Carlo runs under 
Laplace errors. This time the oracle choice is the window [—0.39,0.39]. Our 
median based procedure outperforms the others where the advantage over the 
mean-based approaches is again mainly due to the relative efficiency gain of size 
l/\/2 induced by the base estimators in the Laplace model. This gain, though. 
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regression function and data in Example 2a 



IVIC errors in Example Zb 
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Figure 3: Example 2 with Laplace noise. A typical realisation and a box plot of 
the sample errors in 1000 Monte Carlo runs. 



is not at all visible when using Lepski's method for selecting among the sample 
medians. The results for the error distributions (b) and (c) resemble those of 
the first example, we confine ourselves to summarizing the numerical results for 
all examples in the following table, each time stating the Monte Carlo median 
of the absolute error: 



Ex. 


Mean Lepski 


Mean RR 


Median Lepski 


Median RR 


Median Oracle 


la 


0.1446 


0.1450 


0.2871 


0.0897 


0.0763 


lb 


0.1640 


0.1630 


0.2795 


0.1647 


0.1325 


Ic 


0.0982 


0.0978 


0.3012 


0.0596 


0.0560 


2a 


0.1846 


0.1924 


0.3051 


0.1246 


0.1005 


2b 


0.1808 


0.1886 


0.3430 


0.1586 


0.1241 


2c 


0.2102 


0.2126 


0.2455 


0.1047 


0.0822 



Further simulation experiments confirm this picture. Especially for lower values 
of the moment r our median-based procedure is very efficient, while sometimes 
for r = 2 the mean-based procedures profit from less severe outliers in the 
Monte Carlo runs. In all these experiments the location is equally described by 
mean and median and we mainly see the efficiency gain of the sample median 
for non-Gaussian noise. For general quantile regression, however, linear methods 
do not apply and the standard Lepski procedures based on the nonlinear base 
estimators will perform badly. Our approach gives significantly better results. 
The error reductions by a factor of two and more, achieved in the median 
procedures above, confirm this very clearly. 
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Figure 4: CT scan of the upper abdomen, original and result of denoising 



6 Application 

The proposed procedure is applied to denoise images used in the surveillance 
of cancer therapies. In Dynamic Contrast Enhanced Computer Tomography 
(DCE-CT) a contrast agent is injected in the human body and its diffusion over 
time is observed which is specific for different kinds of cell tissues and allows 
thus the surveillance of cancer therapies. For medical reasons the dose of con- 
trast agent is kept small which leads to a poor signal-to-noise ratio. An analysis 
of residuals shows that the observational noise is well modeled by the Laplace 
distribution. Moreover, sometimes human movements produce significant out- 
liers. Therefore local median estimation is employed. Especially for dynamical 
image sequences, the denoising is remarkably successful when the same spatial 
neighbourhoods are used over the whole observation period. This means that 
at each voxel location Xi a vector-valued intensity function g : ^ ^ is 
observed under vector- valued noise £«. The vector g{xi) encodes the intensity 
at time points (ti, . . . recorded at spatial location Xi. Our previously de- 
veloped procedure perfectly applies to this situation, we just need a testing 
procedure between vector-valued local M-estimators. 

Details of the experimental setup and the estimation procedure are discussed in 
Rozenholc et al. (2009) and we merely give a rough description of the setting. A 
multiresolution test procedure is applied to compare different vector estimates. 
In a first pre-selection step for each voxel Xi we disregard voxels that are signif- 
icantly different from Xi and construct then circular neighbourhoods around Xi 
consisting only of non-rejected voxels. This allows geometrically richer neigh- 
borhood structures that in practice adapt well to the structure. Mathematically, 
the analysis of the algorithm remains the same when conditioning on the result 
of this first pre-selection. 

For the present example we dispose of a DCE-CT sequence of -R' = 53 recordings 
of 512 X 512-pixel images in the upper abdomen of a cancer patient. In Figure 
4 the original image at time step 23 is depicted together with the result of our 
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Figure 5: CT scan of the upper abdomen, residuals and zoom in denoised image 
with neighbourhood constructions around one voxel 

denoising procedure. The noise reduction is remarkable while fine structures like 
edges are well preserved and not smoothed out. The residuals in Figure 5 (left) 
show some artefacts due to human body movements and CT radial artefacts, 
which our procedure removed as well. In Figure 5(right) a zoom into Figure 
4(right) is shown together with the sequence of neighbourhoods constructed 
for one voxel inside the cancerogeneous tissue. The effect of the pre-selection 
step is clearly visible by the geometrically adaptive form of the neighbourhoods. 
Further results, in particular the denoised dynamics in certain voxels and an 
application to automatic clustering of cell tissues are reported in Rozenholc 
et al. (2009). The generality of our procedure has the potential to provide 
statistical solutions in many further applications where spatial inhomogeneity 
and robustness are key issues. 



7 Appendix 



7.1 Proof of Proposition 2.1 



The asymptotic normality of the sample median y^med(yi, . . . , 1^) =^ 
N{0, 1/(4/^(0))) is well known (van der Vaart 1998, Corollary 21.5) and implies 
by independence the first asymptotic result. 

Since the sample medians in the second case are not independent, we consider 
their joint distribution using empirical processes. Let us write i*A for the cumu- 
lative distribution function of /(• — A) and denote by B^, two independent 
standard Brownian bridges. Then empirical process theory yields by indepen- 
dence 

^ n ^ 2n 

nl - Vl([y„oo))-F,- V l{[Yi,^))-FA) ^{B'oF,B^oFA)Ag 

i=l i=n+l 
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The joint median med(Yi, i = 1, . . . , 2n) satisfies in terms of the empirical dis- 
tribution functions and F"^ of the two samples 

F"(med(yi, i = 1, . . . , 2n)) + FX(med(yi, i = 1, . . . , 2n)) = 1. 

Hence, it can be expressed as the functional {F^ + F'^~^{1) of (-F", -F^), assum- 
ing that the inverse is defined properly (e.g. giving the mean of all admissible 
values). Combining two-dimensional versions of Theorem 20.8 and Lemma 21.4 
of van der Vaart (1998), we infer 



n^med(li, i = 1, . . . , n), med(li, i = 1, . . . , 2n) — A/2 
^ ( - (i?i o F/f) o F~\l/2), -{B'oF + B^o Fa)/(/ + /(. - A)) o {F + F^yHl) 

By symmetry of / the right-hand side simplifies to 

- BHl/2)/f{0),-{BHF{A/2)) + B\Fa{ 



Consequently, y^(2(med(yj, i = 1, . . . ,2n) — med(yi, i = 1, . . . ,n)) — A) is 
asymptotically normal with mean zero and variance 

ai = 4E [( - {B\F{A/2)) + B\F{-A/2)))/{2f{A/2)) + i?i(l/2)//(0))'' 

^ / F(-A/2)(l-F(-A/2)) F(A/2)(l-F(A/2)) _J_ _ 1 - F{A/2) x 

I 4/2(A/2) ^ 4/2(A/2) ^4/2(0) 2/(0)/(A/2) 

_ 2F(A/2)(1 -F(A/2)) 1 2(1 - F(A/2)) 
- 'p{Aj2) ^JW)~ /(0)/(A/2) • 

While a\ = for A = is straight-forward, we rewrite a\ in terms of 
R = F/f to study the behaviour as A ^ 0: 

al = 2R{A/2)R{-A/2) + 4i?2(o) - 4i?(0)i?(-A/2). 

Because of -R(A/2) - R{-A/2) = A + 0(A2) by the Lipschitz property of /, 
we obtain asymptotically 



al = alr+2[[R{-A/2)-R{^)y+R{-A/2){R{A/2)-R{-A/2))) = a^+A+0(A2). 
This gives al = a^^{l + 2A/(0) + 0(A2/(0))). □ 



7.2 Proof of Proposition 4.6 



We shall only consider the case of odd = 2m + 1. Under the conditions of 
the proposition Brown et al. (2008) show the following result. 
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7.1 Theorem. For all m ^ the sample ei,...,e2m+i can be realised on 
the same probability space as a standard normal random variable Z such that 
med(e) := med(ei, i = 1, . . . , 2m + 1) satisfies 

medfe) ^ ^ ^ (l + zA if \Z\ ^ dV2m+l, 

v/4(2m+l)M0) "2m+lV ^ ^1 I- v 

where 5,C >0 are constants depending on fe, but independent of m. 

The construction and the inequahty of the theorem yield with some constant 
C" > 

E[|med(£)|2''l(|Z| ^ (5V2m + l)] 

r / \ 2r 

^ (2'-i V 1) E [(4(2m + l)/,(0)2)-n^|^^ + (l + Z^) 

< C'{2m + l)-^. 

On the other hand, because of G L'' we have for z ^ oo tliat tlic cdf satisfies 
Fe{—z) < \z\''^ 1 — Fs{z) < |^;|~''. Prom the formula for the density of 
med(£:) 

fm{z) = (^J^^/) {m + l)fe{z)F,{zr{l - F,{z)r 

we therefore infer that ||med(e) ||j;,3r for m ^ 2 is finite and uniformly bounded. 
Hence, the Holder inequality gives 

E[\ined{e)\^''l{\Z\ > 6V2m + l)] ^ E[|med(e)|='n^^^ > 6V2m + l)^/\ 

which by Gaussian tail estimates is of order cxp(— 5^(2m + l)/6) and thus for 
m ^ oo asymptotically negligible. This gives the upper moment bound for 
med(e), the lower bound follows symmetrically. The r-th moment is bounded 
by even simpler arguments. 

The second assertion follows via quantile coupling from 

P(V4(2m + l)/,(0)|med(e)| > 

^ ^ (1^1 + v£m^^ + > + IP (1^1 > ^V2^^) 

^ F{2\Z\ > r„) + F{\Z\ > 6V2m + 1) 
^2exp(-T2/8). 

□ 
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